Introduction

Lior Blum, Dan Caspi

The main goal of our project is to demonstrate a music recommendation system that is based on audio features and lyrics analysis of tracks.
The system should use machine learning techniques in order to predict which songs each user would probably like.

The main usage of such systems is in music streaming apps platforms (such as Spotify, Apple Music, Youtube Music...). Music recommendation systems have various benefits:

  1. Increasing satisfaction, immersion and engagement of users
  2. Making the platform maximum personalized
  3. Making it convenient to use the platform, with no need to waste time finding new songs
  4. Get insights about users’ preferences and make data-based marketing decisions

Recommendation system types

There are 2 most popular recommendation systems:

The content-based approach relies on the similarity of particular items. While using a streaming music service, a user puts likes or dislikes on the songs, creating playlists, or defining beforehand his/her favorite songs/genres/artists. The main idea of a content-based recommendation system is to extract metadata and/or data of songs that a user loved, compare them with metadata/data of other songs in the service's library, and based on this, recommend similar songs to the user.

In turn, a collaborative system is built on the basis of users’ overlapping preferences and ratings of songs. It assumes that if user A and user B express similar preferences, similar songs can be recommended to them, meaning that if user A likes a particular song, it is likely that this song will also appeal to user B, and vice versa.

Two most popular recommendation systems

Our Emphasis

Collaborative filtering is widely used, not only in music services, but in shopping, video-streaming, and social networks. Its assumption - that people who agreed in the past will agree in the future - is true in many occasions in music too. To our knowledge, all major music streaming services nowadays use some kind of collaborative filtering as part of their recommendations. However, it has some major drawbacks (generally and specifically in music):

Therefore, we decided to emphasize on content-based filtering which helps tackling the two issues explained above. In the music field, the 'content' which we'll use consists of:

  1. Audio features (key, liveness, tempo, etc.)
  2. Lyrics analysis

Recommendation system based on this kind of content does not suffer as much from cold start:

Content-based filtering also does not discriminate unpopoular songs/artists or foreign music, since it purely relies on the essence of music - melody and lyrics. Therefore, it encourages users to broaden their horizons - musically and culturally.

Data Collection

The data used in this project is taken from "Spotify Million Playlist Dataset Challenge" - a continuation of a data science research challenge focused on music recommendation organized by Spotify (See RecSys Challenge 2018).

The project's data consists of:

  1. spotify_million_playlist_dataset (the challenge dataset)
  2. songs_dataset
  3. audio_features_dataset
  4. lyrics_corpus

spotify_million_playlist_dataset

The raw challenge dataset downloaded from https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge/dataset_files).

songs_dataset.json

All songs from the playlists dataset collected with the following code:

audio_features_dataset.json

Various audio features collection generated out of 'songs_dataset'.

Retrieved from Spotify public API (https://api.spotify.com/).

lyrics_corpus.json

Lyrics collection of many of the songs from the playlists dataset. Scraped from Genius Lyrics site and public API (https://genius.com/).

Before we start analyzing the lyrics, we might need to remove outliers. Outliers for our purpose are songs with lyrics too short (which probably indicate that they were not properly scraped or that they are purely instrumental), or too long (which might cause biased analysis).

NRC Word-Emotion Association Lexicon - Emotion-Intensity-Lexicon-v1

A large lexicon comprised of multiple lexicons (one for each emotion) of thousands of English words, their sentiment and the intensity of that sentiment (in scale of 0.0-1.0). Used for Emotion Recognition of songs' lyrics. The annotations were manually done by crowdsourcing.

Exploratory Data Analysis

Lyrics Analysis

Music is (usually) not just a melody, but also lyrics that tell us a story, a feeling, an idea. Lyrics can make people connect with the song, identify with it and fall in love for it. For some people (or in some genres), lyrics aren't that important and only sound matters. For others, lyrics are everything. In order to recommend songs to people, we should at least try to understand what they are saying.

Analyzing lyrics is no simple task. Unlike audio features, that are fetched as numeric features, ready to be processed, lyrics are raw texts that are filled with less-important words, slang words, punctuation and abbreviations. Songs' lyrics also contain metaphors, ironies and plenty of sarcasm. Some problems can be overcome, others not so much. Therefore, we do not aim at understanding each song's entire implicit meaning and hidden themes. We only try to get a clue of what each song is about, and about which songs are similar in that matter.

We will need to find ways to make these texts into normalized numeric features, that can be compared between different songs and clustered.

Considered lyrics analysis techniques:

Document Clustering using Topic Model:

We will focus on two methods of lyrics analysis:

Emotion Recognition

Just as in any type of art, music is driven by emotions. Music allows writers, composers and producers to express their emotions through melodies, verses and choruses. It is also what makes listeners like/dislike certain music and feel attached to/detached from it - which is what interests us in this project.

In some songs, lyrics may give us a better clue about their tone and expressed feelings than their melodies, so we theorize that emotion recognition of lyrics can help us predict which songs each user likes.

We could not find any dataset of songs and their emotional values, and we are not planning to manually construct it by ourselves. Therefore, we'll need to find a dataset that can help us score our songs.

Several datasets were considered for this part but dismissed: Emotions dataset for NLP (from Kaggle), Text Emotion by CrowdFlower, ISEAR, Emotion Intensity in Tweets from the WASSA 2017 and GoEmotions. These datasets all rely on sentences structure, which is inconvenient for lyrics analysis (it is very hard and sometimes impossible to break a song into logical and correct sentences), only give a flat label for each example without a numeric intensity value, and also some of them are based on social media-typed slang, style and symbols, which is usually very different from the prosaic style of songs.

Instead, we will use the Rule-based sentiment analysis approach - score lyrics on word basis, using a given dictionary/lexicon and language rules.

The dataset we found most suitable for our purpose was the NRC Word-Emotion Association Lexicon suite, and specifically the Emotion Intensity Lexicon in it - A dictionary/lexicon of words with their emotion and intensity numeric value.

In the Emotion Recognition part, we give each song eight intensity scores for each of the following basic emotions found in our emotions lexicon dataset:

These scores are given using VADER. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool. But we'll get to it later...

First, let's load all the lexicons:

For example: the first (and strongest) entries of the 'joy' lexicon look like:

We want to make sure the emotions lexicons match our lyrics data and does not miss out any important words. We can't possibly cover every single word that appears in any song, but we'll take the most common ones.

Out of the (500) most common words in lyrics, we will build a list of all assumed sentimental words which are not present in the emotions lexicons. Since most words are not sentimental, we will only add words that are also present in VADER's lexicon, which contains about 7,000 sentimental words.

Not too bad. We'll manually add the important ones to the lexicon's files:
[Note: full Lemmatization and Stemming before analyzing the songs are not needed and should not be done, since most of the words' forms are already in the lexicon, with a distinct value for each different form]

As stated before, VADER module contains a default lexicon of words and their sentiments, with which it analyzes texts and gives them intensity positive, negative and neutral scores in scale of 0.0-1.0. However, this lexicon only has binary sentiments - positive and negative (neutral means none of them), otherwise we could just use plain VADER.

Instead, we will use VADER with our emotions lexicons in the following way:

Since VADER is a generic model, by replacing its lexicon with emotions lexicons we can (hopefully) accurately identify songs' general emotions. Emotions intensities are determined not only by the presence and frequency of words from the corresponding lexicon in the text, but also by the presence of booster/negation words in their context (such as very, extremely, barely, not, etc.)

Let's copy our lexicons to the VADER's directory and create analyzers out of them.

Let's take a look at two songs for example - one very joyful, and the other very sad, according to one of our analyzers.

Note: Whenever we analyze a song, we first of all remove all stopwords, EXCEPT those that are used by VADER for the analysis (negators and boosters).

We can now analyze and score all songs.

We can also visualize the distribution of each emotion and see the differences between them

Let's visualize all features, one pair at a time, to find out how much (if any) they are correlated:

We can see a moderate correlation in two pairs of features:

We may consider omitting one feature out of each pair as a dimensionality reduction if needed

Limitations

As large and exhaustive as our lexicons are, the entire English vocabulary, even just the relatively common and generally used part of it is much larger. Many emotional words will not be detected when using the lexicons alone. Many other words might have multiple meanings and be misunderstood, and others were written differently their dictionary form (like cryin' instead of crying).

Moreover, VADER analyzer is only based on scoring unigrams and some bigrams: single sentimental words and sometimes negatives/boosters that go along with them and increase/reduce/negate their intensity. It might completely ignore phrases and any other meaningful word combinations.

Perhaps we can combine deep learning with our analyzers in order to tackle these issues.

Topic Model

Besides emotions, songs' lyrics talk about plenty of subjects. These subjects can be related to sub-emotions, or they can be completely neutral. We want to find the most common and significant topics in our entire songs corpus, and find out to which topic(s) belongs each song.

A topic model is a generative model that intends to discover underlying topics in a collection of documents and each documents’ assumed closeness to this topic. A popular and well-established Topic Modeling algorithm is Latent Dirichlet Allocation (LDA), which is a probabilistic generative model that builds on the assumption that every document in a literature corpus is a mixture of latent topics and each of these topics themselves is a probability distribution over words.

By using the topic model on our songs lyrics, we may be able to identify common topics/themes in our songs and cluster songs based on these topics.

We'll do some necessary preprocessing first.

Filtering non-English songs

Note: A single topic model is suited for a single language. All the tools that we use can easily be used on different languages - they are not specific to English.

Lyric Tokenization

Token Lemmatization and removing Stop Words

Dictionary Creation and occurrence-based Filtering

In order to perform Latent Dirichlet Allocation, we use the popular and well-established Python library gensim, which requires a dictionary representation of the documents. This means all tokens are mapped to a unique ID, which reduces the overall dimensionality of a literature corpus. In addition, we filter out tokens that occur in less than 60 songs, as well as tokens that occur in more than 80% of songs.

Bag-of-Words and Index to Dictionary Conversion

Each song (as of now a list of tokens) is converted into the bag-of-words format, which only stores the unique token ID and its count for each song.

Choosing LDA implementation

Now our data is finally ready for topic extraction using LDA.

LDA is implemented in two major libraries: Scikit-learn and Gensim. We'll compare their results and runtime with default parameters (no optimization).

Gensim's Topics:

Scikit-learn's Topics:

We can see that the results are quite similar, but the runtime of Gensim is much smaller.

We can summarize the results in the following table, consisting of similar topics between the two according to the visualizations above:

Topic Label(s) Gensim Topic Num Scikit-learn Topic Num
Religious, Spiritual 5 4
Explicit, Violent 1 2
Light Romance, Passion 2 5
Pain, Loss, Fear 4 absent
Nature absent 6
Love 6 ~5,1
Family ~3 3

It looks like both models need optimization.

Since the results of both of them are comparable, we'll continue with Gensim because of its runtime and since Gensim has a built-in evaluation method called Coherence (which we'll expand upon later on).

Determining the number of topics and hyperparameters tuning

Another extra step needs to be taken to optimize results and that is identifying an optimum amount of topics (k), as well as the hyperparameters: Document-Topic Density (α) and Word-Topic Density (η) parameters.

LDA model clusters our songs into topics, but it requires a number of topics parameter - k. If k is too small, our topics will be too general, contain too many words, and the songs that share them will not actually be similar. However, giving a k which is too large will yield fractions of topics, repetition of certain words in many of them, and not enough similar songs for each song.

alpha parameter determines how many topics will be extracted from each song ("document"), and eta determines how many words are included in each topic. Both of these parameters can also impact our results significantly.

The mathematical representation of how good a topic is called Topic Coherence. it means the degree of semantic similarity between high scoring words in the topic (detailed information here). There are a few measures to calculate it, we chose NPMI over the default C_V for reliability reasons (see discussion here).

We will create multiple LDA models for our data, with different parameter values, compute their topics' average coherence values, and find the combination which gives us the maximum value.

We'll tune k, a and then b separately:

Now that we have found the model that yields the most coherent topics, let's visualize the inferred topics:

These topics can be summarized as:

Topic Num (in pyLDAvis) Topic Labels
1 Explicit, Violent
2 Pain, Loss, Anxiety
3 Light Romance, Passion
4 Love, Trust
5 Dance, Party
6 Spiritual, Natural

Looks much better than the topics we started with.

With these topics we can cluster songs that are similar in their content.

Using the function get_document_topics of Gensim's LDA model, we can get topic probabilities list for each song (the probability that it belongs to each of the extracted topics). These can serve as numeric features for our purpose.

Summary

  1. We decided to research the benefit of machine learning in the field of music recommendation.
  2. We introduced the two main concepts of recommendation systems: collaborative filtering and content-based filtering.
  3. We raised an hypothesis that content-based filtering (analyzing audio features & lyrics) can be useful for recommendation/prediction models and can tackle inherent issues that occur in collaborative filtering. Extracting data from lyrics and audio features is also much more interesting in terms of data analysis and ML than collaborative filtering. Therefore, we decided to emphasize on it.
  4. We fetched data (by scraping and by API calls) from various sources (Spotify playlists, audio features and lyrics of songs in the given playlists).
  5. We analyzed lyrics with two different NLP techniques:
    1. Emotion Recognition - using emotional terms lexicons and VADER.
    2. Topic Modeling - using LDA algorithm.

Future plans

  1. Fetching audio features.
  2. Perhaps implement simple Collaborative Filtering for comparison or for improving results.
  3. In emotion recognition we created a model that can give emotion scores for each song, but it's only based on its lexicons. Perhaps we can train a model with some of these scores in order to score all the rest of the songs, so that it may find other emotional words that are not included in the lexicons, and yield us more accurate scores.
  4. The main plan:
    1. Gather all numeric features for each song: audio features, emotion scores and topics relevance.
    2. Using a machine-learning model (probably unsupervised or a combination with a supervised ANN), make clusters of similar songs based on the features.
    3. Build a predictor that returns similar songs for each user based on the created clusters and his/her given playlist.
    4. For comparison, we can take each group of features (audio, emotions, topics) separately and compare their results. According to these results we can give different weights to each features group when we combine them (lower weight to less-important features), and thereby finding the optimal model.